ggolot is a visualization tool in “R” that is highly popular and powerful.
When we are learning a language we should know about its components that are verbs, adjectives, names, etc. We also should know about the place of each component.
ggplot has also a grammar so it has components and each component has a location.
Data : the dataset used for visualization
Aesthetic mapping : variables that are used in visualization (x, y)
Geometric Object : the types of graphs that we want to plot that is called for by geom_x in ggplot. an example is geom_point or geom_line
Statistical Transformations : transforming the data that is called for by stat=. For instance, if we have stat=“identity” we are asking R to plot the real value of the x and y. However, if we have stat = “summary” we are asking R to plot a summary of observations (e.g. mean, sd).
Coordinate System : coordination type. We can use coord_x to call for different coordination system. For instance, coord_flip, rotate the x and y or coord_polar connect both ends of the axis to each other.
Position Adjustments :it is called for by position = "“. There are usually three types of positions that are used more than others including”identity" where everything is put in its position as it is. “stack” where the graphs are stacked on top of each other and finally “dodge” that put the graphs in different positions for a better view. Sometimes we call also use position = “fill”.
Faceting : to split graphs based on a categorical variable
Labeling, themes : how to make labels for data, axis, and titles and change their location or size.
When you work with Rmarkdown make sure you are loading the libraries you need and also import the dataset inside the markdown.
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(plotrix)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:plotrix':
##
## rescale
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggthemes)
library(gganimate)
food <- read.csv("~/Dropbox/Github/food.csv")
The data set used here is called “food”. There are six variables in the dataset that are as follows:
First, we tell R to use function ggplot, where the data used is food. Second, we tell R about the variables that are used in plotting. Here we tell R variable X is going to be on the X-axis and variable Y is going o be on Y-axis.
Third, we tell R the type of graph is the point graph (geom_point). Finally, we tell R the observations should be distinguished with different colors based on variable Z.
ggplot(data=food, aes(x=X, y=Y)) + geom_point(aes(color= Z))
ggplot(data=food, aes(x=income, y=food_share))
We should tell R about the graph type (geom_)
geom_… is about the type of graph, we can add more than one layers of geom to a graph
ggplot(data=food, aes(x=income, y=food_share)) +
geom_point()
In introducing aes() to the ggplot function we can tell R to distinguish between the data points based on the values of another variable. The differentiation can come in the form of different colors, different shapes, and sizes.
For instance, below we plot the scatterplot of food_share and income and we use variable spouse_work (i.e. if spouse work or not) to present the scatterplot for households that both partners work versus the households where only one person has a job.
ggplot(data=food, aes(x=income, y=food_share)) +
geom_point(aes(color=factor(spouse_work)))
# in here we tell R the color of data points should be different based on the spouse_work in the data.
We can also change the shape of the points
ggplot(data=food, aes(x=income, y=food_share)) +
geom_point(aes(color=factor(spouse_work), shape=factor(spouse_work)))
# in here we tell R the color and the SHAPE of data points should be different based on the spouse_work in the data.
There are several GEOMs that we can use, however, the type used depends on the variables used for plotting (continuous or categorical).
Simple histogram
food %>%
ggplot(aes(x=income)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histogram: we add more information to make a better graph
food %>%
ggplot(aes(x=income)) +
geom_histogram(bins= 20,
fill="blue",
color="red")
Below we plot the histogram for income variable where we add more information to make a better graph. We want to have a histogram where we distinguish between cases where both spouses work or when only one person has a job in the household.
we can also use position = “fill” to see the proportions. Other types of positions are identity where it shows the data as it is. Also, we can use position = “dodge” to present the bins for each sex differently.
food %>%
na.omit() %>%
ggplot(aes(x=income)) +
geom_histogram(position = "stack",
bins= 30,
color="black",
aes(fill=factor(spouse_work)))
# change the position to identity, dodge or fill and see what will happen
# change the number of bins
# change fill= factor(sex) to fill = factor(somker)
A density plot can be used to show the probability distribution of a continuous variable. Note that in interpreting a density plot we should consider the area under the curve and not a single data point. A simple density plot
food %>%
ggplot(aes(x=income)) +
geom_density(fill="blue",
color="red",
alpha=0.6)
# change alpha values to change the level of transparency in the filled color.
Density plot for households where both partners works vs. only one person work
food %>%
ggplot(aes(x=income)) +
geom_density(position="stack",
alpha=0.3,
aes(fill=factor(spouse_work)))
#change the postion to "stack"
food %>%
ggplot(aes(x=age_group)) +
geom_bar()
Add more information
food %>%
ggplot(aes(x=age_group)) +
geom_bar(color="green",
fill="gray")
food %>%
ggplot(aes(x=income, y= food_share)) +
geom_point(color="blue",
shape=6,
size=3)
#change the shape number from 4 to 5 or other numbers.
distinguish between data points based on a third variable
food %>%
ggplot(aes(x=income, y= food_share)) +
geom_point(aes(color=factor(age_group), shape=factor(age_group))) +
theme_few()
# use theme_few or other themes under the ggplot package for better visibility.
geom_smooth is used to make a prediction line. In geom_smooth we can determine the method of prediction. There are two methods that we introduce here that are “loess” (stands for locally estimated scatterplot smoothing) and “lm” (stands for linear model). se = F in here tells R that we do not want to plot standard error around the red line.
#method = "loess"
food %>%
ggplot(aes(x=income, y= food_share)) +
geom_point(color="blue") +
geom_smooth(color="red",
method = "loess",
se=F)
#change se=F to se=T
# you can change the color as well.
# scatterplot and geom_smooth: method= "lm"
food %>%
ggplot(aes(x=income, y= food_share)) +
geom_point(aes(color=factor(spouse_work))) +
geom_smooth(method = "lm",
se=F,
aes(color=factor(spouse_work)))
food %>%
ggplot(aes(x=income, y= food_share)) +
geom_point(aes(shape=factor(spouse_work),
color=factor(spouse_work),
size=2),
alpha=0.7) +
geom_smooth(method = "lm",
se=F, size=3,
aes(color=factor(spouse_work)))
# plot of the average of income across age groups
food %>%
group_by(age_group ) %>%
summarize(mean_income= mean(income)) %>%
ggplot(aes(x=age_group,
y=mean_income)) +
geom_col(fill="blue",
color="red") +
geom_label(aes(label=round(mean_income, digits = 1)))
# here we are using geom_lable to add label to the columns. so we are using aes() in the codes. Because we are telling R to use a variable to make the labels. round(mean_income) is used to round the calculated labels with one digit.
# change the digit = 2.
# remove the round.
# change the position to f
Above we used dplyr to make the mean income across age groups. In the case of bar and column graph, we can use the “stat” to statistically transform data for plotting. So we have to tell R that we want the stat=“summary” that is we want to make a graph of summary values. By default, the summary calculated is mean. However, we can use fun.y=sd for standard deviation or sum for sum, min for minimum, max for maximum, etc.
food %>%
ggplot(aes(x=age_group, y=income)) +
geom_bar(fill="blue", color="red",
stat = "summary", fun.y=mean)
# change the fun.y = mean to fun.y = sd
Functions facet_grid or facet_wrap can be used to split the graph into two graphs based on a categorical variable. For instance, in here we tell R to make two graphs based on spouse_work using facet_grid(~ spouse_work).
Also, we can use coord_polar to make a graph similar to a pie chart or radar chart. we can also use polar_flip to rotate the graph.
food %>%
group_by(age_group, spouse_work) %>%
summarise(mean_income = mean(income)) %>%
ggplot(aes(x=age_group, y=mean_income)) +
geom_col(aes(fill=factor(age_group))) +
facet_wrap(~spouse_work, ncol=2)
### Line Graph
food %>%
ggplot(aes(age, y=income)) +
geom_line(aes(color=age_group)) +
facet_wrap(~spouse_work)
econ <- economics %>%
separate(date, c("y", "m", "d"))
econ$date <- economics$date
econ <- econ %>%
mutate(unemployment_rate= 100*unemploy/pop)
econ %>%
ggplot(aes(x=date, y=unemployment_rate)) +
geom_line(aes(color=y), size=1)
Box plot includes 5 important information about the distribution of a continuous variable including minimum, quantile 1 (25%), median, quantile 3 (75%), maximum and outliers.
food %>%
na.omit() %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_boxplot(aes(fill=factor(age_group)))
Using notch we can also test the significant differences between two groups
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_boxplot(notch = T,
aes(fill=factor(age_group)))
# add coord_flip()
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_boxplot(notch = T,
aes(fill=factor(age_group))) +
coord_flip()
#box plot and density plot
box <- ggplot(data= food, aes(y=income)) +
geom_boxplot() +
coord_flip() +
labs(x="Box\nplot")
den <- ggplot(data= food, aes(x=income)) +
geom_density(position = "stack") +
geom_vline(aes(xintercept = median(income)))
grid.arrange(box,den, ncol=1)
food %>%
ggplot(aes(x=factor(age_group), y=food_share)) +
geom_boxplot(notch = T,
aes(fill=factor(age_group))) +
coord_flip() +
theme_few()
food %>%
ggplot(aes(x=factor(age_group), y=food_share)) +
geom_violin(aes(fill=factor(age_group))) +
coord_flip() +
theme_classic()
Violin graphs show the distribution of observation. Combining them with box plots could be informative.
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_violin(aes(fill=factor(age_group))) +
geom_boxplot(aes(fill=factor(age_group))) +
coord_flip() +
theme_classic()
# we can use theme_X to change the theme of a graph.
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_violin(aes(fill=factor(age_group))) +
geom_boxplot(notch = T,
aes(fill=factor(age_group))) +
coord_flip() +
theme_classic()
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_violin(fill="NA",
color="blue")+
geom_jitter(aes(color=factor(age_group))) +
theme_classic() +
coord_flip()
Errorbars can be highly informative. Using dplyr and ggplot we can plot error bars. We have to use group_by, summarize(mean(.)) and summarize(std.error(.)) along with geom_col and geom_errorbar to make the graph.
Below we plot the average income over age groups. First, group the observations by age (age_group). Second, summarise variable income. Because we are looking for confidence intervals we make two variables that are mean that is the average income and sem that is the standard error of mean using summarize(mean = mean(income), sem=std.error(income)). To be able to calculate standard errors you should install and load package “plotrix” Third, we use geom_col to draw bar graph of average income across age groups. Fourth, we use geom_errorbar to make the plot of errorbars. In this stage, we tell R, the min value of y variable that is income for errorbar is equal to mean minus sem (ymin = mean - sem). Also, the max value of y for error bar is equal to mean + sem. mean +/- sem is the formula for the confidence intervals.
food %>%
group_by(age_group) %>%
summarize(mean= mean(income),
sem = std.error(income)) %>%
ggplot(aes(x=age_group, y=mean)) +
geom_col(fill="NA",
size=1, color="red") +
geom_errorbar(aes(ymin=mean-sem,
ymax=mean+sem),
color=c("blue"))+
theme_few()
using ggplot geom_tile or geom_raster we can also make a heat map. Here we plot the heat map of average income. The two categorical variables used for making the heat map are age_group and spouse_work.
We have to use dplyr to make the plot below:
food %>%
group_by(age_group, spouse_work) %>%
summarize(mean= mean(income)) %>%
ggplot(aes(x= age_group, y=spouse_work)) +
geom_tile(aes(fill=mean), color="white")
A plot should speak all by itself. So if we see a graph with no text around it, using the title, legend and labels colors or shapes we have to be able to understand what the plot is about.
food %>%
ggplot(aes(x=income, y=food_share)) +
geom_point(color="blue", size=2, shape=9, alpha=0.5) +
labs(title = "Income and food_share", y= "food_share", x="income ")
food %>%
group_by(age_group) %>%
summarize(mean_income= mean(income),
sem = std.error(income)) %>%
ggplot(aes(x=age_group, y=mean_income)) +
geom_col(color="red", fill="white") +
geom_errorbar(aes(ymin=mean_income-sem,
ymax=mean_income+sem),
color=c("blue"))+
labs(title = "Average income across Age Groups",
subtitle = "UK Households",
y="Income",
x="Age Groups")
food %>%
group_by(age_group) %>%
summarize(mean_income= mean(income),
sem = std.error(income)) %>%
ggplot(aes(x=age_group, y=mean_income)) +
geom_col(color="red", fill="white") +
geom_errorbar(aes(ymin=mean_income-sem,
ymax=mean_income+sem),
color=c("blue")) +
labs(title = "Average income across Age Groups",
subtitle = "UK Households",
y="income",
x="Age Groups") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_text(size = 18,
color = "blue"))
### Increasing the number of ticks (breaks)
food %>%
group_by(age_group) %>%
summarize(mean_income= mean(income),
sem = std.error(income)) %>%
ggplot(aes(x=age_group, y=mean_income)) +
geom_col(color="red", fill="white") +
geom_errorbar(aes(ymin=mean_income-sem,
ymax=mean_income+sem),
color=c("blue")) +
labs(title = "Average income across Age Groups",
subtitle = "UK Households",
y="income",
x="Age Groups") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
axis.title.y = element_text(size = 18,
color = "blue")) +
scale_x_discrete(aes(limit=age_group)) +
scale_y_continuous(breaks = scales::pretty_breaks(10))
### Change legend title and its components There are situations that we need to define the legends labels manually and this can be done in ggplot by scale_color_discrete function. So below we have told R to make a jitter plot and use the education level to color the points. so we can use scale_color_discrete. If we had used fill=factor(age_group) we should use scale_color_discrete. In scale_color_discrete we name=“Title” is used to determine the legend’s title. and labels = c(“A”, “B”) for the name of each legend level.
food %>%
ggplot(aes(x=factor(age_group), y=income)) +
geom_violin(fill="NA", color="blue") +
geom_jitter(aes(color=factor(age_group))) +
labs(title = "Violin and Jitter Plots of Income\nacrss Age Groups", y= "Income",
x="Age Group") +
scale_color_discrete(name = "Age Groups", labels= c("Group1 : 18 to 30 years old", "Group 2", "3rd Group", "Oldest Group"))
Finally, you can make animation with the use of gganimate. So here the movements of the data points are based on age. Although this graph does not have any meaning, it is shown below just to show the example of how animation works.
food %>%
ggplot(aes(x=total_exp, y=food_share)) +
geom_point(aes(color=spouse_work, size=income)) +
transition_reveal(age) +
shadow_trail(distance = 0.1)